Project Proposal

Shark Incidents in California

Author

Luis Bracho, Diego Farah, Diego Paredes

Published

November 25, 2024

Topic and Research Question

The topic of this project is the exploration of shark incidents in California based on a dataset provided by the California Department of Fish and Wildlife. The research questions will focus on identifying patterns in the data rather than establishing causality. For example, some of the research questions include:

  1. Which species of sharks are most commonly involved in incidents?.

  2. Is there a specific activity that is more prone to shark incidents?

  3. What types of injuries are most common in shark incidents, if a shark attacks you… will it kill you for sure?

  4. Is there a town or city where attacks are more common?

  5. There is a depth, mostly incidents occur?

  6. Is there a month or time when more accidents occur?

  7. What percentage of people attend to beach per day and what percentage of beach visitors bathe??

  8. What is the incidents per-capita?

  9. Are sharks really a danger to humans?

Introduction

Shark incidents have long fascinated researchers and the public alike, given the potentially fatal nature of these encounters and their connection to popular beach activities. This project aims to explore the characteristics of shark incidents in California, specifically focusing on the types of injuries sustained (fatal, major, minor or none). We want to know using data, the probability that if a shark attack someone, that person would be dead.

Also We want to explore, the common activities during attacks (scuba diving, Swimming, surfing and other more). and whether particular shark species are more likely to be involved. And last interesting question We want to approach in this study, is what particular shark species are more likely to be involved. The study will provide valuable exploratory insights into the patterns of these incidents, which could help inform safety measures for coastal activities. By analyzing historical data on shark incidents, this research will help understand the frequency of attacks and potentially identify the environmental and behavioral factors that correlate with different injury types.

Data Sources

The primary data source for this project is the California Department of Fish and Wildlife’s dataset on shark incidents, last updated in March 2024. This dataset includes a wide range of variables, such as the date, time, location, water depth, type of human activity, species of shark involved, and the severity of injuries. The data is original and directly collected from incidents reported in California, ensuring its reliability. However, there may be some concerns related to incomplete data entries or inconsistencies in how certain variables were recorded. These issues will be addressed through data cleaning and preparation. For example, the depth and injury fields will need to be standardized, and any missing or ambiguous entries will be carefully handled to maintain the integrity of the analysis.

Link 1: https://wildlife.ca.gov/Conservation/Marine/White-Shark.

Link 2: https://wildlife.ca.gov/Data/Sci-Data.

Link 3: https://catalog.data.gov/dataset/shark-incident-database-california-56167.

Link 4: https://animalbiotelemetry.biomedcentral.com/articles/10.1186/2050-3385-1-2.

Link 4 is quite important as present a study titled “Two-year migration of adult female white sharks (Carcharodon carcharias) reveals widely separated nursery areas and conservation concerns”, published in Animal Biotelemetry, investigates the migratory patterns and ecological significance of adult female white sharks. Utilizing satellite tracking data collected over two years, the research highlights the extensive movement and habitat usage of these apex predators.

Key findings of the study include:

Identification of Migration Routes: Adult female white sharks exhibit long-distance migratory behavior, connecting distinct geographic regions, including widely separated nursery areas.

Anticipated Results

In this analysis, we expect to examine two key variables: the mode of activity during the shark incident and the type of injury sustained. Based on the dataset, it is anticipated that activities like surfing and swimming will show a higher frequency of incidents. Additionally, injuries will likely range from minor to fatal, with surfing-related activities potentially resulting in more severe outcomes. We anticipate that these variables will be distributed in a way that shows many patterns, where certain activities dominate the data. For example, it’s expected that shark incidents are concentrated in shallow waters where swimming and surfing take place. We also expect to find relationships between shark species and injury severity, where species like the Great White Shark may be more commonly associated with fatal incidents.

The first chart illustrates the geographical distribution of shark incidents along the California coast, with a focus on the “Red Triangle” where incidents take place. This helps visualize the relationship between locations and incident frequency. The second chart shows shark incidents by activity type across different years, which helps to visualize the expected distribution of incidents across different activities and how this may change over time.

The distribution of activity types and injury severities will inform our understanding of which activities are most prone to shark incidents and how dangerous these incidents can be. The relationships between activity type and injury type will help answer research questions about what factors contribute to the likelihood and severity of shark-related injuries.

The main goal of this study is to demystify the danger that sharks pose to humans, and to establish calm in sectors that still consider them a threat to society. We will try to do this with real data obtained from government databases, which we will explain later.

Data Cleaning

Dataset 1 Cleaning (Sharks Incidents)

This dataset contains information of Sharks incidents in California.

First, lets load the data in R, and see the first 6 observations.

Code
head(data)
#> # A tibble: 6 × 14
#>   IncidentNum Date                Time        County Location Mode  Injury Depth
#>   <chr>       <dttm>              <chr>       <chr>  <chr>    <chr> <chr>  <chr>
#> 1 1           1950-10-08 00:00:00 0.5         San D… Imperia… Swim… major  surf…
#> 2 2           1952-05-27 00:00:00 0.58333333… San D… Imperia… Swim… minor  surf…
#> 3 3           1952-12-07 00:00:00 0.58333333… Monte… Lovers … Swim… fatal  surf…
#> 4 4           1955-02-06 00:00:00 0.5         Monte… Pacific… Free… minor  surf…
#> 5 5           1956-08-14 00:00:00 0.6875      San L… Pismo B… Swim… major  surf…
#> 6 6           1957-04-28 00:00:00 0.5625      San L… Morro B… Swim… fatal  surf…
#> # ℹ 6 more variables: Species <chr>, Comment <chr>, Longitude <chr>,
#> #   Latitude <dbl>, `Confirmed Source` <chr>, `WFL Case #` <chr>
Code
data = data[1:(nrow(data) - 9), ]
data <- data[, -c(10:14)]

Last 9 rows do not provide any information. So this is necessary to clean data.

For further processing we transform the variables into the desired type. For example, the variables Species ,County, Mode are taken by R as character, but for us it would be better to consider them as factor.

Code
data$IncidentNum = as.numeric(data$IncidentNum)
data$Time = as.numeric(data$Time)
data$Mode = as.factor(data$Mode)
data$County = as.factor(data$County)
data$Location = as.factor(data$Location)

There are observations that are not entirely well given and to better analyze what we want to answer it is necessary to solve these inconsistencies.

Code
data$Depth = ifelse(data$Depth == "surface"  | data$Depth == "surface*", 0, data$Depth)
data$Depth = ifelse(data$Depth == "underwater"  | data$Depth == "submerged" | data$Depth == "submerged*", 5, data$Depth)
data$Injury = ifelse(data$Injury == "minor*", 'minor', data$Injury)
data$Injury = ifelse(data$Injury == "major*", 'major',data$Injury)
data$Injury = ifelse(data$Injury == "none*", 'none',data$Injury)
data$Injury = as.factor(data$Injury)
data$Depth = as.numeric(data$Depth)
data$Species = as.factor(data$Species)
data <- data[!is.na(data$IncidentNum), ]
data$number <- seq_len(nrow(data))
data$Year <- year(data$Date)          
data$Month <- month(data$Date, label = TRUE, abbr = FALSE)     
data$Day <- weekdays(data$Date) 

Let’s summarize our dataset:

Code
summary(data)
#>   IncidentNum          Date                              Time       
#>  Min.   :  1.00   Min.   :1950-10-08 00:00:00.000   Min.   :0.2812  
#>  1st Qu.: 53.25   1st Qu.:1985-03-14 06:00:00.000   1st Qu.:0.4167  
#>  Median :104.50   Median :2004-09-10 00:00:00.000   Median :0.5000  
#>  Mean   :103.81   Mean   :1997-12-23 18:10:41.583   Mean   :0.5316  
#>  3rd Qu.:154.75   3rd Qu.:2013-09-27 00:00:00.000   3rd Qu.:0.6380  
#>  Max.   :205.00   Max.   :2022-02-26 00:00:00.000   Max.   :0.9583  
#>                                                     NA's   :18      
#>            County                     Location                    Mode   
#>  San Diego    :23   Salmon Creek Beach    :  9   Surfing / Boarding :80  
#>  Santa Barbara:19   Farallon Islands      :  7   Freediving         :35  
#>  Humboldt     :18   Tomales Point         :  7   Kayaking / Canoeing:29  
#>  San Mateo    :18   Moonstone Beach       :  5   Swimming           :22  
#>  Marin        :16   San Onofre State Beach:  5   Scuba Diving       :19  
#>  Monterey     :15   La Jolla              :  4   Hookah Diving      :10  
#>  (Other)      :93   (Other)               :165   (Other)            : 7  
#>    Injury       Depth              Species        number            Year     
#>  fatal:15   Min.   : 0.000   White     :179   Min.   :  1.00   Min.   :1950  
#>  major:59   1st Qu.: 0.000   Unknown   : 13   1st Qu.: 51.25   1st Qu.:1985  
#>  minor:49   Median : 0.000   Hammerhead:  3   Median :101.50   Median :2004  
#>  none :79   Mean   : 3.322   Blue      :  2   Mean   :101.50   Mean   :1997  
#>             3rd Qu.: 0.000   Leopard   :  2   3rd Qu.:151.75   3rd Qu.:2013  
#>             Max.   :72.000   Salmon    :  1   Max.   :202.00   Max.   :2022  
#>             NA's   :3        (Other)   :  2                                  
#>        Month        Day           
#>  October  :36   Length:202        
#>  August   :31   Class :character  
#>  September:31   Mode  :character  
#>  July     :23                     
#>  May      :16                     
#>  November :16                     
#>  (Other)  :49

Dataset 2 Cleaning (Beaches Attendance)

data2 dataset has some differences in county names comparing to data dataset, If We are going to compare this datasets and share information in the same time then, It will necessary to have same County names.

Code
data2 = data2[, -c(8:66)]
data2 <- data2[which(data2$State == "California"),]
data2$Attendance = as.numeric(gsub(",", "", data2$Attendance))
data2$County = ifelse(data2$County == "Sanoma County", "Sonoma", data2$County)
data2$County = factor(gsub("(?i)county", "", data2$County))
data2$County = factor(trimws(data2$County))
data2$County = factor(gsub("(?i)county", "", data2$County))
data2$County = factor(trimws(data2$County))
unique(data2$County)
#>  [1] Los Angeles                      Santa Cruz                      
#>  [3] California State Parks           San Diego                       
#>  [5] Orange                           Alameda                         
#>  [7] Santa Barbara                    Ventura                         
#>  [9] East Bay Regional Parks District San Francisco                   
#> [11] Marin                            Humboldt                        
#> [13] San Luis Obispo                  Santa Clara                     
#> [15] Sonoma                          
#> 15 Levels: Alameda California State Parks ... Ventura
Code
names(data2)
#> [1] "Year"        "Region"      "Agency.Name" "City"        "County"     
#> [6] "State"       "Attendance"
Code
dim(data2)
#> [1] 1481    7

Evaluating our proposal expectations

1. Is there a lethal shark, or a species that tries to attack more than others?

First We create a pie chart to see the most common species involved in attacks.

Code
species_counts <- data %>%
  count(Species) %>%
  mutate(percentage = n / sum(n) * 100,
         label = paste0(Species, " (", round(percentage, 1), "%)"))

fig <- plot_ly(data = species_counts,labels = ~Species,values = ~n,
  textinfo = "label+percent",hoverinfo = "text",text = ~label, type = "pie",
  marker = list(colors = RColorBrewer::brewer.pal(n = nrow(species_counts), "Set3")))

fig <- fig %>% layout(
    title = "Distribution of Shark Species",showlegend = TRUE)

fig

We see that white shark is the most common by huge difference.

Code
species_counts
#> # A tibble: 8 × 4
#>   Species        n percentage label            
#>   <fct>      <int>      <dbl> <chr>            
#> 1 Blue           2      0.990 Blue (1%)        
#> 2 Hammerhead     3      1.49  Hammerhead (1.5%)
#> 3 Leopard        2      0.990 Leopard (1%)     
#> 4 Salmon         1      0.495 Salmon (0.5%)    
#> 5 Sevengill      1      0.495 Sevengill (0.5%) 
#> 6 Thresher       1      0.495 Thresher (0.5%)  
#> 7 Unknown       13      6.44  Unknown (6.4%)   
#> 8 White        179     88.6   White (88.6%)

The information provided by these bar plots is revealing, white sharks are the only species that are present in fatal accidents and in those with quite high damage, giving us the understanding that it is the only species that humans can become afraid of.

2. What relationship is between the activity and the Injury type?

To answer this interesting question we created a bar plot for each type of injury, in which we will see which species are involved in each type of injury.

Code
conTable = table(data$Mode, data$Injury)
conTable
#>                      
#>                       fatal major minor none
#>   Freediving              3    17     9    6
#>   Hookah Diving           1     4     2    3
#>   Kayaking / Canoeing     1     0     4   24
#>   Paddleboarding          0     0     0    7
#>   Scuba Diving            0    11     5    3
#>   Surfing / Boarding      5    20    21   34
#>   Swimming                5     7     8    2
#>   Walking in shallow      0     0     0    0
Code
injury_mode_count <- data %>% 
  count(Mode, Injury) %>% 
  rename(injury_count = 'n')

injury_mode_count
#> # A tibble: 23 × 3
#>    Mode                Injury injury_count
#>    <fct>               <fct>         <int>
#>  1 Freediving          fatal             3
#>  2 Freediving          major            17
#>  3 Freediving          minor             9
#>  4 Freediving          none              6
#>  5 Hookah Diving       fatal             1
#>  6 Hookah Diving       major             4
#>  7 Hookah Diving       minor             2
#>  8 Hookah Diving       none              3
#>  9 Kayaking / Canoeing fatal             1
#> 10 Kayaking / Canoeing minor             4
#> # ℹ 13 more rows
Code
# Order activities by the total number of incidents
total_incidents_per_mode <- injury_mode_count %>%
  group_by(Mode) %>%
  summarise(total_incidents = sum(injury_count)) %>%
  arrange(desc(total_incidents))

injury_mode_count$Mode <- factor(injury_mode_count$Mode, 
                                 levels = total_incidents_per_mode$Mode)

# Create the faceted bar chart with adjusted label positions
p<-ggplot(injury_mode_count, aes(x = Mode, y = injury_count, fill = Mode)) +
  geom_bar(stat = "identity") +
  geom_text(aes(label = injury_count), hjust = 2, size = 5) + 
  coord_flip() +
  labs(title = "Injury Types by Activity",
       x = "Activity",
       y = "Number of Incidents") +
  scale_fill_viridis_d(option = "plasma", guide = "none") + 
  facet_wrap(~ Injury, scales = "free_y") + 
  expand_limits(y = max(injury_mode_count$injury_count) * 1.1) +
  theme_minimal() +
  theme(plot.title = element_text(color="coral",face = "bold.italic"),
        axis.title.x = element_text(size=14, face="bold"),
        axis.title.y = element_text(size=14, face="bold"),
        axis.text.x = element_text(size = 10,face="bold"),
        axis.text.y = element_text(size = 8),
        strip.text = element_text(size = 12, face = "bold"))

ggplotly(p)

We see that the most dangerous activities are on the surface, swimming, surfing, freediving. Diving has very few accidents.

3. What types of injuries are most common in shark incidents, if a shark attacks you… will it kill you for sure?

To answer this we created a waffle chart in which we will see the proportion of the type of damage, we will focus on the number of fatal accidents and also the incidents that did not present any damage.

Code
injury_counts <- data %>%
  count(Injury) %>%
  mutate(percentage = n / sum(n) * 100,
         label = paste0(round(percentage, 1), "%")) 

injury_counts_vector <- setNames(injury_counts$n, injury_counts$Injury)

waffle_chart <- waffle(injury_counts_vector, rows = 10, 
                       colors = RColorBrewer::brewer.pal(n = length(injury_counts$Injury), "Set3"),
                       title = "Distribution of Injury Types")

total_cells <- sum(injury_counts_vector)

waffle_data <- expand.grid(row = 1:10, col = 1:ceiling(total_cells / 10))
waffle_data <- waffle_data[1:total_cells, ]
waffle_data$group <- rep(names(injury_counts_vector), injury_counts_vector)

colors <- RColorBrewer::brewer.pal(n = length(injury_counts$Injury), "Set3")
waffle_data$color <- unlist(lapply(waffle_data$group, function(g) {
  colors[which(names(injury_counts_vector) == g)]
}))

fig <- plot_ly(data = waffle_data,x = ~col,y = ~row,color = ~group,
               colors = colors,type = "scatter",mode = "markers",
            marker = list(size = 15, symbol = "square")) %>%
  layout(title = "Distribution of Injury Types",
        xaxis = list(showgrid = FALSE, zeroline = FALSE, showticklabels = FALSE),
        yaxis = list(showgrid = FALSE, zeroline = FALSE, showticklabels = FALSE))

fig

We see that if someone have a shark attack not always is going to die. In fact the the 39 % of attacks did not damage.

4. Is there a town or city where attacks are more common?

To answer this We create a barchart.

Code
Table = table(data$County, data$Injury)
Table
#>                         
#>                          fatal major minor none
#>   Del Norte                  0     0     2    1
#>   Humboldt                   0     7     2    9
#>   Island - Catalina          0     0     1    3
#>   Island - Farallones        0     7     0    0
#>   Island - San Miguel        1     2     2    0
#>   Island - San Nicolas       0     0     1    0
#>   Island - Santa Barbara     0     0     0    0
#>   Island - Santa Cruz        0     0     1    1
#>   Island - Santa Rosa        0     1     0    0
#>   Los Angeles                1     0     6    2
#>   Marin                      0     9     4    3
#>   Mendocino                  1     3     1    0
#>   Monterey                   2     8     2    3
#>   Orange                     0     1     2    5
#>   San Diego                  2     4     8    9
#>   San Francisco              1     0     0    1
#>   San Luis Obispo            3     3     1    7
#>   San Mateo                  1     1     4   12
#>   Santa Barbara              2     2     6    9
#>   Santa Cruz                 1     3     3    8
#>   Sonoma                     0     8     1    6
#>   Ventura                    0     0     2    0
Code
county_counts <- data %>%
  count(County) %>%
  arrange(desc(n))

p <- ggplot(county_counts, aes(x = reorder(County, n), y = n, fill = County)) +
  geom_bar(stat = "identity", fill = "steelblue") +
  geom_text(aes(label = n), hjust = 3, size = 5.5, color = "black") +
  coord_flip() +
  labs(title = "Shark Incidents per County",
       x = "County",y = "Number of Incidents") +
  theme_minimal() +
  theme(plot.title = element_text(color="coral",size = 16, face = "bold.italic"),
        axis.title.x = element_text(size=14, face="bold"),
        axis.title.y = element_text(size=14, face="bold"),
        axis.text.x = element_text(size = 10),
        axis.text.y = element_text(size = 10),
        legend.position = "none")

ggplotly(p)

Here we can see that the city where the largest number of accidents occurred was san diego with a little over 25 attacks recorded. San Francisco, for example, does not have many attacks (less than 5) and Los Angeles, which is a large city, has less than 10.

5. There is a depth, mostly incidents occur?

Code
Table_depth = table(data$Depth, data$Injury)
Table_depth
#>     
#>      fatal major minor none
#>   0     14    43    44   71
#>   5      0     1     0    1
#>   10     0     1     0    0
#>   12     0     1     0    0
#>   14     0     0     0    1
#>   15     0     1     1    1
#>   18     0     2     0    0
#>   19     0     1     0    0
#>   20     0     3     0    1
#>   25     0     1     1    2
#>   28     0     1     0    0
#>   30     0     0     0    1
#>   38     0     1     0    0
#>   40     0     2     1    0
#>   47     0     0     1    0
#>   72     0     0     1    0

We see that the highest number of accidents, both fatal and non-fatal, were recorded for activities carried out on the surface, giving us the understanding, for example, that diving is a safe activity if we refer to the danger of being attacked by a shark.

6. Is there a month or time when more accidents occur?

Code
# Extract month from the Date column and create a month factor
data$Month <- format(as.Date(data$Date, format = "%Y-%m-%d"), "%B")
data$Month <- factor(data$Month, levels = month.name, ordered = TRUE)

# Count the number of incidents per month
month_counts <- data %>%
  count(Month) %>%
  arrange(Month)

# Plot the number of shark incidents by month
p<-ggplot(month_counts, aes(x = Month, y = n, group = 1)) +
  geom_line(color = "darkred", size = 1) +
  geom_point(color = "darkred", size = 2) +
  labs(title = "Shark Incidents by Month",
       x = "Month",
       y = "Number of Incidents") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1),
        plot.title = element_text(color = "Coral",size = 16, face = "bold"),
        axis.title.x = element_text(size = 16, face = "bold"),
        axis.title.y = element_text(size = 16, face = "bold"))

ggplotly(p)

According to information taking from: “Two-year migration of adult female white sharks (Carcharodon carcharias) reveals widely separated nursery areas and conservation concerns” The white shark mating season occurs in spring or summer, in temperate waters, We conclude that the highest number of accidents occur at the end of the mating seasons.

But there was an even more important issue, we have enough evidence that the highest number of accidents occur during surfing activities by far, so it is important to know on what dates this activity is practiced according to information taken from… the month with the best weather for surfing is October. The other activities with the highest number of accidents are apnea and swimming, which for reasons of weather and vacations are practiced in summer from June to October, which makes sense given the time series graph shown above.

Attendance and Per capita Incidents

7. What percentage of people attend to beach per day and what percentage of beach visitors bathe?

This is where we will motivate the idea of getting an estimate of the number of people who attended each day using the estimates obtained from the study: “the human tide: beach attendance and bathing rates for southern california beaches”. in which the following conclusion is reached, of monthly attendance in percentage, this percentage was obtained in the year 2007, but we will use it to estimate the other years (because the availability of data is a complicated challenge):

Code
month <- c("January", "February", "March", "April", "May", "June", "July", "August", "September", "October", "November", "December")
month_percentage <- c(4.6/119.3, 4/119.3, 6/119.3,7/119.3, 9/119.3, 16/119.3, 28/119.3, 24/119.3, 12/119.3, 5.4/119.3, 4/119.3, 4/119.3)

data_month_per_assistance <- data.frame(Month = month, month_percentage = month_percentage)
data_month_per_assistance
#>        Month month_percentage
#> 1    January       0.03855826
#> 2   February       0.03352892
#> 3      March       0.05029338
#> 4      April       0.05867561
#> 5        May       0.07544007
#> 6       June       0.13411567
#> 7       July       0.23470243
#> 8     August       0.20117351
#> 9  September       0.10058676
#> 10   October       0.04526404
#> 11  November       0.03352892
#> 12  December       0.03352892

They also provide us with the percentage of beach attendance per day.

Code
days_of_week <- c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday")
percentage <- c(0.10, .8, .9, .10, .15, .27, .21)

data_day_per_assistance <- data.frame(Day = days_of_week, day_percentage = percentage)
data_day_per_assistance
#>         Day day_percentage
#> 1    Monday           0.10
#> 2   Tuesday           0.80
#> 3 Wednesday           0.90
#> 4  Thursday           0.10
#> 5    Friday           0.15
#> 6  Saturday           0.27
#> 7    Sunday           0.21

The graph above was taken from “Beach Attendance and Bathing Rates for Southern California Beaches”.

And the percentage of people who bathe, that is, who go into the sea.

Code
month <- c("January", "February", "March", "April", "May", "June", "July", "August", "September", "October", "November", "December")
month_percentage <- c(0.26,.28,.33,.31,.41,.50,.52,.54,.50,.36,.29,.27)

data_month_per_bath <- data.frame(Month = month, month_percentage = month_percentage)
data_month_per_bath$month_asis_percentage <- data_month_per_bath$month_percentage*data_month_per_assistance$month_percentage
data_month_per_bath 
#>        Month month_percentage month_asis_percentage
#> 1    January             0.26           0.010025147
#> 2   February             0.28           0.009388097
#> 3      March             0.33           0.016596815
#> 4      April             0.31           0.018189438
#> 5        May             0.41           0.030930427
#> 6       June             0.50           0.067057837
#> 7       July             0.52           0.122045264
#> 8     August             0.54           0.108633697
#> 9  September             0.50           0.050293378
#> 10   October             0.36           0.016295054
#> 11  November             0.29           0.009723386
#> 12  December             0.27           0.009052808

The graph above was taken from “Beach Attendance and Bathing Rates for Southern California Beaches”.

In dataset 1 there is information from 1950’s to 2022.

Code
summary(data$Year)
#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#>    1950    1985    2004    1997    2013    2022

In dataset 2 there is information from 1964 to 2023, so We do not have information on visits to the beaches before 1964.

Code
summary(data2$Year)
#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#>    1964    1990    2003    2001    2014    2023

It is also important to note that the information on beach visits does not exactly match the same counties for the two datasets. So it is a challenge to combine the information (as well as to search for it, which was very tedious.)

Code
p<-ggplot(data, aes(x = reorder(County, -table(County)[County]))) +
   geom_bar(fill = "steelblue", color = "black") +
   labs(title = "Number of incidents per county",x = "Counties",y = "Frequency") +
   theme_minimal() +
   theme(axis.text.x = element_text(angle = 45, hjust = 1),
         plot.title = element_text(color = "Coral",size = 16, face = "bold"),
         axis.title.x = element_text(size = 16, face = "bold"),
         axis.title.y = element_text(size = 15, face = "bold"))

ggplotly(p)
Code
p<-ggplot(data2, aes(x = reorder(County, -table(County)[County]))) +
   geom_bar(fill = "steelblue", color = "black") +
   labs(title = "Counties in dataset 2",x = "Counties",y = "Frequency") +
   theme_minimal() +
   theme(axis.text.x = element_text(angle = 45, hjust = 1),
         plot.title = element_text(color = "Coral",size = 16, face = "bold"),
         axis.title.x = element_text(size = 16, face = "bold"),
         axis.title.y = element_text(size = 15, face = "bold"))

ggplotly(p)

Calculating the number of people who went to each beach per year and per county,

Code
data3 <- data2 %>%
  group_by(Year, County) %>%
  summarise(total_attendance = sum(Attendance, na.rm = TRUE), .groups = "drop")

head(data3)
#> # A tibble: 6 × 3
#>    Year County      total_attendance
#>   <int> <fct>                  <dbl>
#> 1  1964 Orange               3669310
#> 2  1965 Los Angeles         13500000
#> 3  1965 Orange              12038943
#> 4  1966 Los Angeles         14000000
#> 5  1966 Orange              12855876
#> 6  1966 San Diego            3220540
Code
merged_data = merge(data3, data, by.x=c('Year', 'County'), 
                               by.y=c('Year', 'County'))
names(merged_data)
#>  [1] "Year"             "County"           "total_attendance" "IncidentNum"     
#>  [5] "Date"             "Time"             "Location"         "Mode"            
#>  [9] "Injury"           "Depth"            "Species"          "number"          
#> [13] "Month"            "Day"

8. What is the incidents per-capita?

Addressing this question is quite complicated because it is not necessarily accessible to obtain information on the number of people who attend the beach on the exact days in which accidents occurred. Searching for this data day by day is an overwhelming task, so for certain counties it can only be answered with the information that we were able to find after much effort in dataset 2.

It is important to note that there is no data on attendance at some beaches, especially some islands. Therefore, only the cases of accidents per capita will be studied for places where we do have the data.

Code
data4 <- merged_data %>%
  group_by(County) %>%
  summarise(attendance_county = sum(total_attendance, na.rm = TRUE), .groups = "drop")

head(data4)
#> # A tibble: 6 × 2
#>   County        attendance_county
#>   <fct>                     <dbl>
#> 1 Humboldt                      0
#> 2 Los Angeles           479493906
#> 3 Marin                   3100000
#> 4 Orange                271829516
#> 5 San Diego             598154151
#> 6 San Francisco            990000

Now We present total incidents ocurred from 1950 to 2022. (We present this graph above).

Code
county_counts <- data %>%
  count(County) %>%
  arrange(desc(n))

Now We present total attendance of people from 1964 to 2023.

Code
data4
#> # A tibble: 11 × 2
#>    County          attendance_county
#>    <fct>                       <dbl>
#>  1 Humboldt                        0
#>  2 Los Angeles             479493906
#>  3 Marin                     3100000
#>  4 Orange                  271829516
#>  5 San Diego               598154151
#>  6 San Francisco              990000
#>  7 San Luis Obispo           6173998
#>  8 Santa Barbara            45762839
#>  9 Santa Cruz               23143415
#> 10 Sonoma                     817187
#> 11 Ventura                    392489

So We are in conditions to calculate incidents per capita in each county. We cannot calculate for all counties because we do not have the data for all of them. But we do have the data for the most important ones, in the document named: “Beach Attendance and Bathing Rates for Southern California Beaches”, is estimated that about 45% of beach visitors actively engage in recreational water contact anually, It makes no sense to calculate the incidents per capita for people who do not go into the sea.

Code
county_counts <- county_counts %>%
  mutate(County = trimws(tolower(County)))

data4 <- data4 %>%
  mutate(County = trimws(tolower(County)))

merged_data <- county_counts %>%
  inner_join(data4, by = "County")

merged_data <- merged_data %>%
  mutate(attendance_contact_water = 0.45*attendance_county)

merged_data <- merged_data %>%
  mutate(per_capita_incident = n / attendance_contact_water)
  
merged_data[-c(3), ] 
#> # A tibble: 10 × 5
#>    County         n attendance_county attendance_contact_w…¹ per_capita_incident
#>    <chr>      <int>             <dbl>                  <dbl>               <dbl>
#>  1 san diego     23         598154151             269169368.        0.0000000854
#>  2 santa bar…    19          45762839              20593278.        0.000000923 
#>  3 marin         16           3100000               1395000         0.0000115   
#>  4 santa cruz    15          23143415              10414537.        0.00000144  
#>  5 sonoma        15            817187                367734.        0.0000408   
#>  6 san luis …    14           6173998               2778299.        0.00000504  
#>  7 los angel…     9         479493906             215772258.        0.0000000417
#>  8 orange         8         271829516             122323282.        0.0000000654
#>  9 san franc…     2            990000                445500         0.00000449  
#> 10 ventura        2            392489                176620.        0.0000113   
#> # ℹ abbreviated name: ¹​attendance_contact_water

As we can see, the incidents per capita are insignificant, with a maximum probability of 0.0000113 (which is the case of Ventura beach). So We have concluded the main goal in this research, We are in conditions to claim that It is almost impossible that a Shark kill someone.